Mini Project 1

Author

Harry Sohal

Fiscal Analysis of Transit Systems


Intro

This analysis looks to provide some insights on the financial information behind some of the biggest transit agencies in the US. The data set being studied includes massive systems like the NYC Metro and also lesser used systems in states like Hawaii and Oklahoma.

Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
`summarise()` has grouped output by 'NTD ID', 'Agency Name'. You can override using the `.groups` argument.
Rows: 3744 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Agency, City, State, NTD ID, Organization Type, Reporter Type, UZA...
dbl  (2): Report Year, UACE Code
num (10): Primary UZA Population, Agency VOMS, Mode VOMS, Vehicle Operations...
lgl  (7): Vehicle Operations Questionable, Vehicle Maintenance Questionable,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'NTD ID'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'NTD ID', 'Agency', 'UZA Name', 'Mode', '3 Mode'. You can override using the `.groups` argument.

Task One & Two: Data Cleaning

After loading the data certain column names and row values need to be changed for syntactic reasons and ease of understanding. Firstly, changing the column UZA Name to metro_area will allow for syntactic ease, and a more intuitive label.

USAGE <- inner_join(TRIPS, MILES) |>
    mutate(`NTD ID` = as.integer(`NTD ID`))
Joining with `by = join_by(`NTD ID`, Agency, `UZA Name`, Mode, `3 Mode`,
month)`
    sample_n(USAGE, 1000) |> 
    rename("metro_area" = "UZA Name") %>% #renaming UZA
    mutate(month=as.character(month))
# A tibble: 1,000 × 8
   `NTD ID` Agency                 metro_area Mode  `3 Mode` month    UPT    VRM
      <int> <chr>                  <chr>      <chr> <chr>    <chr>  <dbl>  <dbl>
 1    70015 City of Wichita        Wichita, … DR    Bus      2010… 6.25e3 6.23e4
 2    40009 City of Fayetteville   Fayettevi… MB    Bus      2022… 1.63e5 8.74e4
 3    10064 Greater Attleboro-Tau… Providenc… MB    Bus      2021… 2.68e4 1.02e5
 4    50146 Madison County Transi… St. Louis… DR    Bus      2010… 7.17e3 6.73e4
 5    90226 Imperial County Trans… El Centro… MB    Bus      2021… 2.46e4 4.25e4
 6    90013 Santa Clara Valley Tr… San Jose,… DR    Bus      2011… 6.53e4 5.11e5
 7    50052 South Bend Public Tra… South Ben… DR    Bus      2024… 4.92e3 2.39e4
 8    30007 Greater Roanoke Trans… Roanoke, … DR    Bus      2013… 4.99e3 4.53e4
 9    40022 Metropolitan Atlanta … Atlanta, … HR    Rail     2015… 5.89e6 1.82e6
10    50057 Rock Island County Me… Davenport… MB    Bus      2016… 2.23e5 1.73e5
# ℹ 990 more rows

The second section will be “re-coding” the Mode date in the dataframe. Since the data originally had codes like “HR” for Heavy Rail, or “FB” for Ferry Boat it will be hard for someone ignorant of these codes to understand the meaning behind them. So we will find all the distinct values, search their meanings, and then rename those values. The results will be displayed a Data Table using the DT library

  distinct_modes<- USAGE %>% distinct(Mode)
 
   USAGE <- USAGE |>
     mutate(Mode=case_when(
          Mode == "HR" ~ "Heavy Rail", 
          Mode == "DR"~"Demand Response",
          Mode == "FB"~"Ferryboat",
          Mode == "MB"~"Bus",
          Mode == "SR"~"Streetcar Rail",
          Mode == "TB"~"Trolleybus",
          Mode == "VP"~"Vanpool",
          Mode == "CB"~"Commuter Bus",
          Mode == "RB"~"Bus Rapid Transit",
          Mode == "LR"~"Light Rail",
          Mode == "YR"~"Hybrid Rail",
          Mode == "MG"~"Monorail Automated Guideway",
          Mode == "CR"~"Commuter Rail",
          Mode == "AR"~"Alaska Railroad",
          Mode == "TR"~"Aerial Tramway",
          Mode == "IP"~"Inclined Plane",
          Mode == "PB"~"Publico",
          Mode == "CC"~"Cable Car",
          TRUE~"Unknown"))
    Use_table <- DT::datatable(USAGE)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Task 3: Answering Questions Using dyplr

What transit agency had the most total VRM in our data set?

Using the arrange function we can sort the data based on a ascending or descending order of a specified value. Using arrange combined with desc function we get the all the VRM values starting from highest to lowest. The corresponding agency will be the answer.

Max_VRM <- USAGE %>% 
    select(VRM,Agency) %>%
    arrange(desc(VRM)) %>% 
    head(1)
    print(Max_VRM)
# A tibble: 1 × 2
       VRM Agency                   
     <dbl> <chr>                    
1 30882144 MTA New York City Transit

What transit mode had the most total VRM in our data set?

Using the same idea as the previous question, this time combining it with the group_by function will allow us to find the answer. The group_by function creates “groups” of a given variable to then perform some action to those groups. In this case we will group by the mode and then sum up the VRM for each mode. This method will provide the answer “Bus”

vrm_by_mode <- USAGE %>% 
    select(VRM,Mode) %>%
    group_by(Mode) %>% 
    summarise(Total = sum(VRM)) %>% 
    arrange(desc(Total))
    print(vrm_by_mode)
# A tibble: 18 × 2
   Mode                              Total
   <chr>                             <dbl>
 1 Bus                         49444494088
 2 Demand Response             17955073508
 3 Heavy Rail                  14620362107
 4 Commuter Rail                6970644241
 5 Vanpool                      3015783362
 6 Light Rail                   2090094714
 7 Commuter Bus                 1380948975
 8 Publico                      1021270808
 9 Trolleybus                    236840288
10 Bus Rapid Transit             118425283
11 Ferryboat                      65589783
12 Streetcar Rail                 63389725
13 Monorail Automated Guideway    37879729
14 Hybrid Rail                    37787608
15 Alaska Railroad                13833261
16 Cable Car                       7386019
17 Inclined Plane                   705904
18 Aerial Tramway                   292860

How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?

To answer this the use of the filter function is required. The filter function will remove all the data that doesn’t agree with the filter statement, allowing for more specific analysis. Filtering for a May 2024 date and a NYC MTA Agency Subway will give the answer, but first some data manipulating is required. Creating a year and month column with the lubridate library, allows for quick and easy date time analysis and will come in handy, making future analysis streamlined. After doing so the previous techniques are used.

  typeof(USAGE$month)#checking the type of column month is
[1] "double"
library(lubridate)
  USAGE$month <- ymd(USAGE$month) #changing to date format
  USAGE <- USAGE %>% 
     rename("trips"=UPT) %>% #renaming the UPT(Unlinked passenger trips)- to trips
     rename("date"=month) %>%  
     mutate("year"= year(date)) %>%
     mutate('month'= month(date))
  nyc_trips <- USAGE %>% 
    filter(Agency == "MTA New York City Transit" & Mode =="Heavy Rail")%>% 
    filter(year==2024 & month == 5) %>% 
    summarise(may_total=sum(trips))
    print(nyc_trips)
# A tibble: 1 × 1
  may_total
      <dbl>
1 180458819

How much did NYC subway ridership fall between April 2019 and April 2020?

Using all the techniques from the previous questions, along with the pull method to give a singular value will provide an answer for this question. After pulling the data for both cases we can simply find the difference.

  nyc_monthly <- USAGE %>% #this table provides info that might be useful later on
    filter(Agency == 'MTA New York City Transit' & Mode == 'Heavy Rail') %>% 
    group_by(year, month) %>%                       
    mutate(total_vrm = sum(VRM)) %>% 
    mutate(total_trips = sum(trips)) %>%
    select(year,month,total_trips,total_vrm)
   vrm_2019 <- nyc_monthly %>%
    filter(year == 2019 & month == 4) %>% 
    pull(total_vrm)
  
  vrm_2020 <- nyc_monthly %>%
    filter(year == 2020 & month == 4) %>% 
    pull(total_vrm)
  
  print("VRM Difference")
[1] "VRM Difference"
  vrm_2019-vrm_2020 #difference in Vehicle Revenue Miles
[1] 12200677
  trips_2019 <- nyc_monthly %>%
    filter(year == 2019 & month == 4) %>% 
    pull(total_trips)
  
  trips_2020 <- nyc_monthly %>%
    filter(year == 2020 & month == 4) %>% 
    pull(total_trips)
  print("Trips Difference")
[1] "Trips Difference"
  trips_2019-trips_2020 #decrease in trips
[1] 211969660

Task 4:

What month had the highest VRM in 2023

Using a combintion of techniques like filtering, summarize, as well as grouping by the month column will provide the answer.

     USAGE %>% 
     filter(year == 2023) %>% 
     select(year,month,VRM) %>% 
     group_by(month) %>% 
     summarise(total_vrm = sum(VRM)) %>% 
     arrange(desc(total_vrm))
# A tibble: 12 × 2
   month total_vrm
   <dbl>     <dbl>
 1     8 422002435
 2    10 420684229
 3     3 413311032
 4     5 407605645
 5    11 402542477
 6     6 400875583
 7    12 399428041
 8     7 398685473
 9     9 398172153
10     1 389141732
11     4 386876164
12     2 361882307

What year had the most trips?

Similarly with this question grouping by year and computing the desired value, using mutate this time, gives us the answer. We can also plot this to see some yearly trends using ggplot.

  yearly_trips <- USAGE %>% 
     group_by(year) %>% 
     mutate(total_trips = sum(trips)) %>%
     mutate(avg_trip_day = total_trips/365) %>% 
     select(year,total_trips,avg_trip_day) %>% 
     arrange(desc(total_trips)) %>% 
     distinct(year,total_trips,avg_trip_day)
     #plotting
     ggplot(yearly_trips,aes(x= year,y=avg_trip_day))+
       geom_line(color = 'blue')+
       labs(title = "Trends in Avg Trips Per Day", y = "Unlinked Trips",x = "Years")+
       theme_minimal()

Predictably, there is a huge drop in 2020 in average trips.

What area has the highest/lowest trips, excluding the NYC area?

To answer this we can group by metro area and then summarize for the mean of area. After generating a dataframe with these values we can see Chicago was the highest followed by Las Vegas area, and the lowest was Cheyenne,WY. We can also create a bar graph of the top 10 areas to visualize this clearly. Using the aes (aesthetic) function with fill = metro_area allows for a colorful representation of each chart. This method creates a legend we don’t need so we set show.legend = FALSE. The element text function allows for the X label to be angled and positioned correctly for long label names.

   USAGE <- USAGE %>% rename("metro_area" = "UZA Name")#had to rerun this code
 
   trips_by_area <- USAGE %>% 
        select(year,metro_area,trips) %>% 
        group_by(metro_area) %>% 
        summarise(avg_trips = mean(trips)) %>% 
        arrange(desc(avg_trips)) %>% 
        head(10)

    ggplot(trips_by_area,aes(x = metro_area,y = avg_trips))+
        geom_bar(stat = 'identity',aes(fill = metro_area),color = "red",show.legend = FALSE)+
        theme_minimal()+
        theme(axis.text.x = element_text(angle = 45, hjust = 1,vjust = 1))+
        labs(title = "Average Trips by City",y= "Unlinked Passenger Trips", x= "Metro Area")

Task 5: Table Summarization

This task pertains to joining two tables to create a Usage and financial table. We will join on the keys NTD ID and Mode. Since the mode is different in the financial table (recall it was altered for usability) it must also be changed. Grouping by NTD ID and Mode allows for us answer the questions in Task 6 which requires transportation agency and method to be combined, since each agency has its own unique ID this is exactly the same thing.

  USAGE_2022_ANNUAL <- USAGE %>% 
        filter(year == 2022) %>% 
        select('NTD ID',Mode,Agency,metro_area,trips,VRM) %>% 
        group_by(`NTD ID`,Mode) %>% 
        summarise(annual_trips = sum(trips),annual_vrm = sum(VRM)) %>% 
        ungroup()
`summarise()` has grouped output by 'NTD ID'. You can override using the
`.groups` argument.
      Financials <- Financials |>
     mutate(Mode=case_when(
          Mode == "HR" ~ "Heavy Rail", 
          Mode == "DR"~"Demand Response",
          Mode == "FB"~"Ferryboat",
          Mode == "MB"~"Bus",
          Mode == "SR"~"Streetcar Rail",
          Mode == "TB"~"Trolleybus",
          Mode == "VP"~"Vanpool",
          Mode == "CB"~"Commuter Bus",
          Mode == "RB"~"Bus Rapid Transit",
          Mode == "LR"~"Light Rail",
          Mode == "YR"~"Hybrid Rail",
          Mode == "MG"~"Monorail Automated Guideway",
          Mode == "CR"~"Commuter Rail",
          Mode == "AR"~"Alaska Railroad",
          Mode == "TR"~"Aerial Tramway",
          Mode == "IP"~"Inclined Plane",
          Mode == "PB"~"Publico",
          Mode == "CC"~"Cable Car",
          TRUE~"Unknown"))
      
      USAGE_AND_FINANCIALS <- left_join(USAGE_2022_ANNUAL, 
           Financials, 
           join_by(`NTD ID`, Mode)) |>
           drop_na()

Task 6: Farebox Recovery Among Major Systems

Which transit system (agency and mode) had the most UPT in 2022?

Using Group_by and summarize with max( ) we get the answer New York MTA and Heavy Rail.

      USAGE_AND_FINANCIALS %>% 
        group_by(`Agency Name`,`Mode`) %>% 
        summarise(max_trips = max(annual_trips)) %>% 
        arrange(desc(max_trips)) %>% 
        head(1)
# A tibble: 1 × 3
# Groups:   Agency Name [1]
  `Agency Name`             Mode        max_trips
  <chr>                     <chr>           <dbl>
1 MTA New York City Transit Heavy Rail 1793073801

Which transit system (agency and mode) had the highest fare-box recovery(Ratio of Total Fares to Expenses)?

Using mutate to create another calculated column with the Fare Box Ratio formula we get the answer, Vanpool in the Central Kentucky agency. This makes sense as vanpool is essentially car pooling in small vehicles like vans and small buses for shorter distances. This allows for lower costs then other modes like for example Heavy Rail, which require lots of maintenance and overhead.

      USAGE_AND_FINANCIALS<-USAGE_AND_FINANCIALS %>% 
         group_by(`Agency Name`,`Mode`) %>% 
         mutate(fare_box_recovery = `Total Fares`/Expenses) %>%
         arrange(desc(fare_box_recovery))
         print(USAGE_AND_FINANCIALS)
# A tibble: 1,132 × 8
# Groups:   Agency Name, Mode [1,130]
   `NTD ID` Mode    annual_trips annual_vrm `Agency Name` `Total Fares` Expenses
      <dbl> <chr>          <dbl>      <dbl> <chr>                 <dbl>    <dbl>
 1    40191 Vanpool         9640      94027 Transit Auth…         97300    40801
 2    40034 Vanpool       395004    3091052 County of Mi…       1987879  1191874
 3    90233 Vanpool        70093     462346 Yuma County …        411216   279585
 4    20190 Ferryb…      3757873     504037 Port Imperia…      33443241 23417248
 5    11239 Ferryb…       878728     188694 Hyannis Harb…      25972659 18383764
 6    11238 Ferryb…        96707      56980 Bay State LLC       6287351  4672351
 7    20169 Commut…       403646    1259602 Trans-Bridge…      11325199  8495611
 8    40001 Inclin…       481957      20128 Chattanooga …       3005198  2290714
 9    66339 Vanpool       118780    1748901 New Mexico D…        757574   588830
10       12 Vanpool       189684    1582484 Municipality…       1400709  1105911
# ℹ 1,122 more rows
# ℹ 1 more variable: fare_box_recovery <dbl>

Which transit system (agency and mode) has the lowest expenses per UPT?

Similarly we can answer this question.

       USAGE_AND_FINANCIALS %>% 
           group_by(`Agency Name`,`Mode`) %>% 
           mutate(expenses_per_trip = Expenses/annual_trips) %>% 
           select(`Agency Name`,Mode,expenses_per_trip) %>% 
           arrange(expenses_per_trip)
# A tibble: 1,132 × 3
# Groups:   Agency Name, Mode [1,130]
   `Agency Name`                                         Mode  expenses_per_trip
   <chr>                                                 <chr>             <dbl>
 1 North Carolina State University                       Bus                1.18
 2 Anaheim Transportation Network                        Bus                1.28
 3 Valley Metro Rail, Inc.                               Stre…              1.49
 4 University of Iowa                                    Bus                1.54
 5 Chatham Area Transit Authority                        Ferr…              1.60
 6 Texas State University                                Bus                2.05
 7 South Florida Regional Transportation Authority       Bus                2.27
 8 University of Georgia                                 Bus                2.31
 9 Hillsborough Area Regional Transit Authority          Stre…              2.45
10 University of Michigan Parking and Transportation Se… Bus                2.52
# ℹ 1,122 more rows

Indicates that buses have a low cost per person.

Which transit system (agency and mode) has the highest total fares per UPT?

In this question, an extra step will be taken for later use to both easily answer more questions and also effectively produce some visualizations based on the given problems. Firstly, using mutate to calculate Fares/ UPT(Trips) grouped by the agency and mode will be enough to answer the question. The second additional step will be to make a new column that contains both the agency and the mode in one column. This will make it easier to get this information, and also will allow use to graph these grouped problems. To do this the use of the stringr library. This library is used for manipulating strings, and in this case it will be used to combine the 2 strings in both agent and mode columns.

library(stringr)
       fare_per_trip <- USAGE_AND_FINANCIALS %>% 
           group_by(`Agency Name`,`Mode`) %>% 
           mutate(fares_per_trip = `Total Fares`/annual_trips) %>% 
           mutate(Agent_Mode = str_c(`Agency Name`,Mode, sep = ", ")) %>% #joining the col strings with a comma and space
           arrange(desc(fares_per_trip)) %>% 
           select(`Agency Name`,fares_per_trip,Mode,Agent_Mode) %>%
           head(10)
       print(fare_per_trip)
# A tibble: 10 × 4
# Groups:   Agency Name, Mode [10]
   `Agency Name`                                 fares_per_trip Mode  Agent_Mode
   <chr>                                                  <dbl> <chr> <chr>     
 1 Altoona Metro Transit                                  660.  Dema… Altoona M…
 2 Alaska Railroad Corporation                            153.  Alas… Alaska Ra…
 3 Bay State LLC                                           65.0 Ferr… Bay State…
 4 Central Pennsylvania Transportation Authority           50.2 Dema… Central P…
 5 Hampton Jitney, Inc.                                    41.3 Comm… Hampton J…
 6 County of Placer                                        38.8 Comm… County of…
 7 Audubon Area Community Services, Inc.                   37.4 Dema… Audubon A…
 8 Lane Transit District                                   34.0 Dema… Lane Tran…
 9 Pennsylvania Department of Transportation               32.3 Comm… Pennsylva…
10 Hyannis Harbor Tours, Inc.                              29.6 Ferr… Hyannis H…

With the new column, removing agent and mode columns.

  fare_per_trip %>% 
     ungroup() %>% 
     select(Agent_Mode,fares_per_trip)
# A tibble: 10 × 2
   Agent_Mode                                                     fares_per_trip
   <chr>                                                                   <dbl>
 1 Altoona Metro Transit, Demand Response                                  660. 
 2 Alaska Railroad Corporation, Alaska Railroad                            153. 
 3 Bay State LLC, Ferryboat                                                 65.0
 4 Central Pennsylvania Transportation Authority, Demand Response           50.2
 5 Hampton Jitney, Inc., Commuter Bus                                       41.3
 6 County of Placer, Commuter Bus                                           38.8
 7 Audubon Area Community Services, Inc., Demand Response                   37.4
 8 Lane Transit District, Demand Response                                   34.0
 9 Pennsylvania Department of Transportation, Commuter Rail                 32.3
10 Hyannis Harbor Tours, Inc., Ferryboat                                    29.6

As you can see it is very simple to read

Since the Altoona per trip price is so high, some additional analysis is required.

  USAGE %>%
    group_by(Mode,Agency) %>% 
    filter(Agency == 'Altoona Metro Transit'& year == 2022) %>% 
    filter(Mode== "Demand Response") %>%
    summarise(total_trips = sum(trips))
`summarise()` has grouped output by 'Mode'. You can override using the
`.groups` argument.
# A tibble: 1 × 3
# Groups:   Mode [1]
  Mode            Agency                total_trips
  <chr>           <chr>                       <dbl>
1 Demand Response Altoona Metro Transit          26
USAGE_AND_FINANCIALS <- USAGE_AND_FINANCIALS %>% 
  mutate(Agent_Mode = str_c(`Agency Name`,Mode, sep = ", ")) #adding to main dataset

USAGE_AND_FINANCIALS %>%
    filter(Agent_Mode == 'Altoona Metro Transit, Demand Response') %>% 
    select(Agent_Mode,`Total Fares`)
Adding missing grouping variables: `Agency Name`, `Mode`
# A tibble: 1 × 4
# Groups:   Agency Name, Mode [1]
  `Agency Name`         Mode            Agent_Mode                 `Total Fares`
  <chr>                 <chr>           <chr>                              <dbl>
1 Altoona Metro Transit Demand Response Altoona Metro Transit, De…         17163

Given that there is in fact 26 recorded trips with total fares of 17,163 there might be a chance that some info is missing, and if this is the case then the Alaskan Railroad would be the answer to this problem.

Graphical Representation

 ggplot(fare_per_trip,aes(x = Agent_Mode,y = fares_per_trip))+
           geom_bar(stat = "identity",aes(fill= Agent_Mode),color= 'purple',show.legend = FALSE)+
           theme_minimal()+
           theme(axis.text.x = element_text(angle = 45,vjust =1,hjust = 1))+
           labs(title = "Fairs Per UPT(Unlinked Passenger Trips)", x = "Agency & Mode",y = "Fares Per Passenger Trips")

Which transit system (agency and mode) has the lowest expenses per VRM?

We can use that new column along with arrange to get the lowest expense per VRM.

         USAGE_AND_FINANCIALS %>% 
           group_by(Agent_Mode) %>% 
           mutate(expenses_per_vrm = Expenses/annual_vrm) %>% 
           arrange(expenses_per_vrm) %>% 
           select(Agent_Mode,expenses_per_vrm) %>% 
           head(5)
# A tibble: 5 × 2
# Groups:   Agent_Mode [5]
  Agent_Mode                                                expenses_per_vrm
  <chr>                                                                <dbl>
1 New Mexico Department of Transportation, Vanpool                     0.337
2 VIA Metropolitan Transit, Vanpool                                    0.370
3 County of Miami-Dade, Vanpool                                        0.386
4 County of Volusia, Vanpool                                           0.393
5 Corpus Christi Regional Transportation Authority, Vanpool            0.431

Using arrange(desc( ) ) gives us the ability to get the transit system with the highest expense per VRM.

         USAGE_AND_FINANCIALS %>% 
           group_by(Agent_Mode) %>% 
           filter(annual_vrm != 0) %>% # to avoid dividing by 0
           mutate(expenses_per_vrm = Expenses/annual_vrm) %>% 
           arrange(desc(expenses_per_vrm)) %>% 
           select(Agent_Mode,expenses_per_vrm) %>% 
           head(5)
# A tibble: 5 × 2
# Groups:   Agent_Mode [5]
  Agent_Mode                                                    expenses_per_vrm
  <chr>                                                                    <dbl>
1 Altoona Metro Transit, Demand Response                                   1207.
2 New York City Department of Transportation, Ferryboat                     771.
3 New Orleans Regional Transit Authority, Ferryboat                         468.
4 Loop Trolley Transportation Development District, Streetcar …             412.
5 Washington State Ferries, Ferryboat                                       383.

Notice there are several Ferryboats in this descending table. This indicates Ferries are more costly to operate then some of the other Modes like Vanpool, which does make sense.

Which transit system (agency and mode) has the highest total fares per VRM?

         fares_vrm<-USAGE_AND_FINANCIALS %>%
           group_by(Agent_Mode) %>% 
           filter(annual_vrm != 0) %>% # to avoid dividing by 0
           mutate(fares_per_vrm = `Total Fares`/annual_vrm) %>%
           arrange(desc(fares_per_vrm)) %>% 
           select(Agent_Mode,fares_per_vrm) %>% 
           head(10)
         print(fares_vrm)
# A tibble: 10 × 2
# Groups:   Agent_Mode [10]
   Agent_Mode                                                      fares_per_vrm
   <chr>                                                                   <dbl>
 1 Chicago Water Taxi (Wendella), Ferryboat                                237. 
 2 Altoona Metro Transit, Demand Response                                  229. 
 3 Jacksonville Transportation Authority, Ferryboat                        158. 
 4 Chattanooga Area Regional Transportation Authority, Inclined P…         149. 
 5 Hyannis Harbor Tours, Inc., Ferryboat                                   138. 
 6 SeaStreak, LLC, Ferryboat                                               115. 
 7 Bay State LLC, Ferryboat                                                110. 
 8 Cape May Lewes Ferry, Ferryboat                                          93.0
 9 Woods Hole, Martha's Vineyard and Nantucket Steamship Authorit…          91.7
10 Washington State Ferries, Ferryboat                                      78.1

This result shows many Ferryboat modes, which makes sense as ferry’s will be traveling shorter distances while charging the same or more than longer forms of transport.

Graphical Representation

 ggplot(fares_vrm,aes(x = Agent_Mode,y = fares_per_vrm))+
           geom_bar(stat = "identity",aes(fill= Agent_Mode),color= 'lightblue',show.legend = FALSE)+
           theme_bw()+
           theme(axis.text.x = element_text(angle = 45,vjust =1,hjust = 1))+
           labs(title = "Fairs Per VRM(Vehicle Revenue Miles)", x = "Agency & Mode",y = "Fares Per VRM")